-
Notifications
You must be signed in to change notification settings - Fork 617
fix(instrumentation-runtime-node): use absolute results in eventLoopUtilization computation #3118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3ac9601 to
8d0191d
Compare
|
@d4nyll CCLA Signed. Sorry for the delay. Note : I'm available for discussing it upon need, ideally during Europe business hours, but can arrange otherwise. |
|
Hey @mveroone, thank you for raising the issue and apoloigies for taking my time on it. It is indeed a big bug. In your fix, there are areas of code for which ELU metrics being reported won't take into account: const elu = eventLoopUtilizationCollector(this._lastValue);
// From here
observableResult.observe(elu.utilization);
this._lastValue = elu;
// To here
this._lastValue = eventLoopUtilizationCollector();Whilst it's not such a big deal (it's only the timespan of running those two lines), it would be preferable to leave no time gaps with what's being captured. Calling What do you think about this implementation instead? const currentELU = eventLoopUtilizationCollector();
const deltaELU = eventLoopUtilizationCollector(currentELU, this._lastValue);
this._lastValue = currentELU;
observableResult.observe(deltaELU.utilization);It will:
|
|
Hey @d4nyll , Your solution also has the advantage of being way more self-explaining and should likely confuse future readers less than the previous version. I took the liberty to commit your suggestion, hope that's fine by you ? |
|
@mveroone Hey! On my local branch I added the following test to it('should correctly calculate utilization deltas across multiple measurements', async function () {
// This test ensures the bug where delta of deltas was observed instead of deltas of absolute values
// does not regress. See https://github.com/open-telemetry/opentelemetry-js-contrib/pull/3118
// This bug would surface on the third callback invocation.
const instrumentation = new RuntimeNodeInstrumentation({});
instrumentation.setMeterProvider(meterProvider);
// Helper function to create blocking work that results in high utilization
const createBlockingWork = (durationMs: number) => {
const start = Date.now();
while (Date.now() - start < durationMs) {
// Busy wait to block the event loop
}
};
// Helper function to collect metrics and extract utilization value
const collectUtilization = async (): Promise<number> => {
const { resourceMetrics } = await metricReader.collect();
const scopeMetrics = resourceMetrics.scopeMetrics;
const utilizationMetric = scopeMetrics[0].metrics.find(
x => x.descriptor.name === METRIC_NODEJS_EVENTLOOP_UTILIZATION
);
assert.notEqual(utilizationMetric, undefined, 'metric not found');
assert.strictEqual(utilizationMetric!.dataPoints.length, 1, 'expected one data point');
return utilizationMetric!.dataPoints[0].value as number;
};
// Wait for some time to establish baseline utilization
await new Promise(resolve => setTimeout(resolve, 200));
// First collection
const firstUtilization = await collectUtilization();
assert.notStrictEqual(firstUtilization, 1, 'Expected utilization in first measurement to be not 1');
// Second measurement: Create blocking work and measure
createBlockingWork(50);
const secondUtilization = await collectUtilization();
assert.strictEqual(secondUtilization, 1, 'Expected utilization in second measurement to be 1');
// Third measurement: Create blocking work again and measure
// This is where the bug would manifest - if we were observing delta of deltas,
// this measurement would not be 1
createBlockingWork(50);
const thirdUtilization = await collectUtilization();
assert.strictEqual(thirdUtilization, 1, 'Expected utilization in third measurement to be 1');
// Fourth measurement (should be the same as the third measurement, just a sanity check)
createBlockingWork(50);
const fourthUtilization = await collectUtilization();
assert.strictEqual(fourthUtilization, 1, 'Expected utilization in fourth measurement to be 1');
// Fifth measurement: Do some NON-blocking work (sanity check, should be low)
await new Promise(resolve => setTimeout(resolve, 50));
const fifthUtilization = await collectUtilization();
assert.ok(fifthUtilization < 0.1, 'Expected utilization in fifth measurement to be less than 0.1');
});On close inspection, for my suggested code / your last commit (96ec7a4) to work on the first scrape, So I think the last thing we need to do here is:
I appreciate this PR is turning into a bigger one than you first imagine, but thanks for working with me to get this bug squashed! |
|
Thanks a lot @d4nyll
No problem, I too prefer we arrive at the right robust and future-proof solution instead of a quick and simple one. Is that fine if I commit both of your suggestions myself ? I'm not accustomed to this community's traditions, and I definitely wouldn't want to rob your of attribution for your work. EDIT : as per our private conversation, you're welcome to send a PR with the above suggestions against this branch and i'll gladly review it to the best of my ability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mveroone I think it looks good. Just need to update the branch with the main branch, fix any conflicts, run the tests again (just to make sure) and we should be good 🙏
packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts
Outdated
Show resolved
Hide resolved
packages/instrumentation-runtime-node/src/metrics/eventLoopUtilizationCollector.ts
Outdated
Show resolved
Hide resolved
95cff5d to
8398e8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this !
|
I see the unit test failing for Node.js v18 with |
packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts
Show resolved
Hide resolved
…ion.test.ts Co-authored-by: Daniel Li <[email protected]>
I have been trying to reproduce but haven't had any chance to. Either by emulating GHA tests with Could it depend on the parallelization of tests by |
With a5fe644 it would be very unlikely to be flakey as the event loop would need to be completely busy for the entire 50ms we are waiting for it. @mveroone can you update the branch with the base branch and @david-luna would you be able to run the tests again afterwards? |
Which problem is this PR solving?
The current implementation of the Event Loop Utilization passes a delta value to the call instead of an absolute one.
The NodeJS perf_hooks documentation is a little ambiguous but it does say that if calling
eventLoopUtilization()with 1 argument, it should be the result of a call to that same function without argument :(Emphasis is mine)
The result of this bug is that the value tends to stabilize over time because we pass a diff of a diff of a diff and we tend to just return the value since the start of the process instead of a delta since last execution
Short description of the changes
Replaced the setting of the
lastValueinternal variable with a call to the argument-less perf_hook.Given that this only queries internal counters, I believe it's light enough that we can afford to call it twice per tick. The alternative would be to bypass the auto-calculation of the delteas provided as a helper and perform calculation of the ratio ourselves with a couple arithmetic operations.
Note
As a reference, Datadog library fixed the same bug last month, but they chose to disregard nodejs autocalculation of the utilization ratio and just do it themselves
DataDog/dd-trace-js#6344
(line 259 in the new version of the file, search for "elu" if needed)